from IPython.display import Image
Image(filename = 'Business Case.jpg')
Problem Statement: To reduce the loss to company which happens due to aggressive drivers to logistic companies. This happens due to various reasons like weather conditions or lack of sleep or due to not maintaining the vehicle properly and this list of factors can go on. Breaks need to be respected as an important resting time for drivers, on the other hand, unscheduled or long breaks can impact your schedule and result in unhappy customers. Wear and tear can be caused by multiple factors, from aggressive driving to an inefficient maintenance system. Due to natural disaster or the extreme weather conditions can further increase the loss of the companies
Till now there were no methods to identify the aggressive drivers which were on-route for delivery. But with the help of techniques of machine learning now a driver can be identifed if he is a aggressive,normal or vague driver. The data is available in three different files the driver data, vehicle data and weather data combining all of them together will give us better insight about the different factors behind the aggressivness of the driver.
Taking only the required libraries for the visualization
import matplotlib.pyplot as plt
import pandas as pd
from numpy import *
import numpy as np
import seaborn as sns
import plotly
import plotly.offline as pyoff
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode, iplot, plot
import plotly.graph_objs as go
%matplotlib inline
from tqdm import tqdm_notebook
import random
import math
import pandas_profiling
import warnings
warnings.filterwarnings('ignore')
init_notebook_mode(connected=True)
some NA values can't just 'NA' there could be a white space or a question mark
na_values=[" ",".","NA","?","-",""] so using in this way we can specify possible list of NA values
train_data = pd.read_csv('Train.csv',na_values=[" ",".","NA","?","-",""])
train_vehicle = pd.read_csv('Train_Vehicletravellingdata.csv',na_values=[" ",".","NA","?","-",""])
train_weather = pd.read_csv('Train_WeatherData.csv',na_values=[" ",".","NA","?","-",""])
First we will take The Train.csv file and do the basic observation of the data like info and shape as well as count of null values
We will also rename the columns for easy interpretation
Let's take one dataset at a time: Train data
ID :"ID" V2 :"vehicle_length_cm" V5 :"vehicle_weight_kg" V6 :"number_of_axles" DrivingStyle :"DrivingStyle"(Target)
print(train_data.shape)
train_data.isnull().sum()
train_data.columns = ["ID","vehicle_length_cm","vehicle_weight_kg","number_of_axles","DrivingStyle"]
train_data.info()
from above output it is evident that there is no null value
Let's take the Second dataset: Vehicle data
We will perform the same operations as above
train_vehicle.columns = ["ID","trip_datetime","lane_no","vehicle_speed","pvehicle_id","pvehicle_speed_kph","pvehicle_weight_kg","pvehicle_length_cm","pvehicle_timegap","weather_road_cond"]
Column pvehicle_timegap has 2455 null values
# checking null values in train_vehicle
print(train_vehicle.shape)
train_vehicle.isnull().sum()
for pvehicle_timegap we will do a mean imputation. But the value generated by mean calculation will be in floating point, hence converting the mean value to the closest value by converting it to integer datatype
train_vehicle['pvehicle_timegap'] = train_vehicle['pvehicle_timegap'].fillna(train_vehicle['pvehicle_timegap'].mean())
train_vehicle['pvehicle_timegap'] = train_vehicle['pvehicle_timegap'].astype('int64')
We will also change the datatype of trip_datetime to datetime64 so that we can use that in future as and when required
train_vehicle['trip_datetime'] = train_vehicle['trip_datetime'].astype('datetime64')
train_vehicle.info()
Let's take the Second dataset: Weather data
We will perform the same operations here as well
train_weather.columns = ["ID",'trip_datetime','air_temp','prep_type','prep_intensity','realtive_humidity','wind_direction','wind_speed_ms','daylight_cond']
train_weather.info()
# checking null values in train_weather
print(train_weather.shape)
train_weather.isnull().sum()
Mode imputation for prep_intensity and for the rest of the null value columns we will do mean Imputation and try to bring them to their nearest value by using ceil and floor functions
train_weather['air_temp'] = train_weather['air_temp'].fillna(math.ceil(train_weather['air_temp'].mean())) #ceil
train_weather['realtive_humidity'] = train_weather['realtive_humidity'].fillna(math.ceil(train_weather['realtive_humidity'].mean()))
train_weather['wind_direction'] = train_weather['wind_direction'].fillna(180) #ceil
train_weather['wind_speed_ms'] = train_weather['wind_speed_ms'].fillna(math.floor(train_weather['wind_speed_ms'].mean()))
train_weather['prep_intensity'] = train_weather['prep_intensity'].astype('category')
# train_weather['prep_intensity'].mode()
train_weather['prep_intensity'] = train_weather['prep_intensity'].fillna(train_weather['prep_intensity'].mode()[0])
train_weather.isnull().sum()
train_weather['trip_datetime'] = train_weather['trip_datetime'].astype('datetime64')
The basic preprocessing finishes here now we will join all the dataset using 'Inner' Join
The Inner join between the train data and vehicle data will be done on the basis of 'ID' column
The Inner join between the joined data and weather data will be done on the basis of 'ID' and 'trip_datetime' column
data = pd.merge(train_data,train_vehicle,on='ID',how='inner')
print(data.shape)
data = pd.merge(data,train_weather,on=['ID','trip_datetime'],how='inner')
print(data.shape)
# for efficient memory usage
del train_data
del train_vehicle
del train_weather
converting all the object data type columns to categories
for column in data.columns:
if data[column].dtypes == 'object':
data[column] = data[column].astype('category')
checking the summary statistics for all the columns
data.describe(include='all').T
Checking the corealtion of all the columns with each other
f, ax = plt.subplots(figsize=(13,9))
sns.heatmap(data.corr(),annot=True)
so from above graph it is evident that features have less correlation with each other, which means each feature has lot of variance. Further check the values between vehicle_length_cm vs pvehicle_length_cm
we'll try to further disect down the time_gap columns into following: Day, Month, Year, Hour, Minute, Second
data['trip_date'] = data['trip_datetime'].dt.date
data['trip_date_day'] = data['trip_datetime'].dt.day
data['trip_month'] = data['trip_datetime'].dt.month
data['trip_year'] = data['trip_datetime'].dt.year
data['trip_time'] = data['trip_datetime'].dt.time
data['trip_hour'] = data['trip_datetime'].dt.hour
data['trip_minute'] = data['trip_datetime'].dt.minute
data['trip_sec'] = data['trip_datetime'].dt.second
Calculating the speed differnce between the current vehicle and the preceeding vehicle
data['speed_difference'] = data['vehicle_speed'] - data['pvehicle_speed_kph']
since we have the speed and timegap between the current vehicle and preceeding vehicle we will make of this formula to calculate the distance between the current and the preceeding vehicle
speed=distance/time
data['distance'] = (data['pvehicle_speed_kph']/3.6)*data['pvehicle_timegap']
data['distance'] = data['distance'].astype('int64')
# calculating the bin size
bins = list(range(0,data['distance'].max()+10,18613))
labels = ['Low','Medium','High']
data['binned_distance'] = pd.cut(data['distance'], bins=bins, labels=labels)
data['binned_distance'].value_counts()
data.columns
designing a custom color pallette for plotly graphs
N = 30
c = ['hsl('+str(h)+',50%'+',50%)' for h in linspace(0, 360, N)]
Image(filename = 'QnA_header.jpg')
With each visual graphs I was getting the answer to many of my questions
Visualizing the categorical columns
def barplot(temp,column_name):
plot_data = [go.Bar(
x= temp.index,
y= temp.values,
text = temp.values,
textposition = 'auto',
marker = dict(color=c[np.random.random_integers(N)])
)]
layout = go.Layout(
autosize=True,
title = "Distribution of"+" "+column_name,
)
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
del temp
bar_graph_columns = ['number_of_axles','DrivingStyle','weather_road_cond']
for column in bar_graph_columns:
barplot(data[column].value_counts(),column)
def boxplot(temp,column_name):
plot_data = [go.Box(
y = data['vehicle_length_cm'],
name = column_name,
marker = dict(color=c[np.random.random_integers(N)]))]
layout = go.Layout(title = "Boxplot of"+" "+column_name)
fig = go.Figure(data= plot_data, layout=layout)
iplot(fig)
del temp
boxplot_colums = ['vehicle_length_cm','vehicle_weight_kg', 'pvehicle_weight_kg', 'pvehicle_length_cm']
for column in boxplot_colums:
temp = data[column]
boxplot(temp,column)
# Number of axles vs Driving style
plot_data = []
temp = data.groupby(['DrivingStyle','number_of_axles']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['DrivingStyle','number_of_axles','Count']
# temp
temp['number_of_axles'].value_counts()
for i in np.sort(temp['number_of_axles'].unique()):
trace = go.Bar(x = temp.DrivingStyle[temp.number_of_axles==i],
y = temp.Count[temp.number_of_axles==i],
text = temp.Count[temp.number_of_axles==i],
textposition = 'auto',
name = str(i))
plot_data.append(trace)
layout = go.Layout(title = 'Driving Style vs Number of axles',
yaxis = dict(title='Count'),
xaxis = dict(title='Driving Style'),
barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
on the x-axis we have the Driving Style and on the y-axis we have the count. In the above graph we are now getting more insights of the Aggressive drivers that aggressiving driving is independent of the vehicle
plot_data = []
temp = data.groupby(['DrivingStyle','weather_road_cond']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['DrivingStyle','weather_road_cond','Count']
# temp
temp['weather_road_cond'].value_counts()
for i in np.sort(temp['weather_road_cond'].unique()):
trace = go.Bar(x = temp.DrivingStyle[temp.weather_road_cond==i],
y = temp.Count[temp.weather_road_cond==i],
text = temp.Count[temp.weather_road_cond==i],
textposition = 'auto',
name = str(i))
plot_data.append(trace)
layout = go.Layout(title = 'Driving Style vs Road Condition',
yaxis = dict(title='Count'),
xaxis = dict(title='Driving Style'),
barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
In the above distribution we got to know that aggrressive drivers also driver very rashly in wet road conditions.
filter3=data.loc[data['DrivingStyle'] == 1]
sns.lmplot(x='realtive_humidity', y='air_temp', hue='DrivingStyle',aspect=2,
data=data[['realtive_humidity','air_temp','DrivingStyle']],fit_reg=False,size=9)
there are 2 stories here:
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
sns.kdeplot(data['vehicle_speed'], data['pvehicle_speed_kph'], cmap=cmap, shade=True);
# sns.lmplot(x='vehicle_speed', y='pvehicle_speed_kph',aspect=2,
# data=data[['vehicle_speed','pvehicle_speed_kph']],fit_reg=False,size=9)
The above plot tells us that the speed of the preceeding vehicle and the vehicle are in a range of 70-100. Since the driver is driving in the highway area which will be having speed limits in this case the speed limit is 100
sns.lmplot(x='vehicle_speed', y='pvehicle_speed_kph',hue='lane_no',aspect=2,
data=data[['vehicle_speed','pvehicle_speed_kph','lane_no']],fit_reg=False,size=9)
The drivers in the lane 1 are driving fast as compared to lane to and the number of vehicles in lane 1 are more as compared to lane 2
plot_data = []
temp = data.groupby(['trip_hour','DrivingStyle']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['trip_hour','DrivingStyle','Count']
# temp
temp['DrivingStyle'].value_counts()
for i in np.sort(temp['DrivingStyle'].unique()):
trace = go.Bar(x = temp.trip_hour[temp.DrivingStyle==i],
y = temp.Count[temp.DrivingStyle==i],
text = temp.Count[temp.DrivingStyle==i],
textposition = 'auto',
marker = dict(color=c[np.random.random_integers(N)]),
name = str(i))
plot_data.append(trace)
layout = go.Layout(title = 'Hour vs Driving Style',
yaxis = dict(title='Count'),
xaxis = dict(title='Driving Style'),
barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
It is evident from the above graph:
plot_data = []
temp = data.groupby(['daylight_cond','DrivingStyle']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['daylight_cond','DrivingStyle','Count']
# temp
temp['DrivingStyle'].value_counts()
for i in np.sort(temp['DrivingStyle'].unique()):
trace = go.Bar(x = temp.daylight_cond[temp.DrivingStyle==i],
y = temp.Count[temp.DrivingStyle==i],
text = temp.Count[temp.DrivingStyle==i],
textposition = 'auto',
marker = dict(color=c[np.random.random_integers(N)]),
name = str(i))
plot_data.append(trace)
layout = go.Layout(title = 'Day Light Condition vs Driving Style',
yaxis = dict(title='Count'),
xaxis = dict(title='Driving Style'),
barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
This is the proof that if in the particular daylight conditions the drivers tend to drive aggressively.
plot_data = []
temp = data.groupby(['daylight_cond','trip_hour']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['daylight_cond','trip_hour','Count']
# temp
temp['daylight_cond'].value_counts()
for i in np.sort(temp['daylight_cond'].unique()):
trace = go.Bar(x = temp.trip_hour[temp.daylight_cond==i],
y = temp.Count[temp.daylight_cond==i],
text = temp.Count[temp.daylight_cond==i],
textposition = 'auto',
marker = dict(color=c[np.random.random_integers(N)]),
name = str(i))
plot_data.append(trace)
layout = go.Layout(title = 'Day Light Condition vs Driving Style',
yaxis = dict(title='Count'),
xaxis = dict(title='Driving Style'),
barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
With the above data and the knowledge from this website(https://www.accuweather.com/en/weather-news/five-world-capitals-shortest-daylight/41734413) we can say that the country is near to the northern hemisphere.
plot_data = []
# for i in np.sort(data.DrivingStyle.unique()):
trace1 = go.Box(y = data.vehicle_weight_kg[data.DrivingStyle==1],
marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgba(219, 64, 82, 0.6)'),
name = "Aggrressive")
trace2 = go.Box(y = data.vehicle_weight_kg[data.DrivingStyle==2],
marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgba(219, 64, 82, 0.6)'),
name = "Normal")
trace3 = go.Box(y = data.vehicle_weight_kg[data.DrivingStyle==3],
marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgba(219, 64, 82, 0.6)'),
name = "Vague")
plot_data = [trace1,trace2,trace3]
layout = go.Layout(
autosize=True, # auto size the graph? use False if you are specifying the height and width
# width=1000, # height of the figure in pixels
# height=600, # height of the figure in pixels
title = "Boxplot of {} column based on {} ".format('Vehicle_weight','Driving Style'), # title of the figure
# more granular control on the title font
titlefont=dict(
family='Courier New, monospace', # font family
size=14, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
title='Driving Style',
tickfont=dict(
family='Courier New, monospace', # font family
size=14, # size of ticks displayed on the x axis
color='black' # color of the font
)
),
yaxis=dict(
title='Weight of Vehicle',
titlefont=dict(
size=14,
color='black'
),
tickfont=dict(
family='Courier New, monospace', # font family
size=14, # size of ticks displayed on the y axis
color='black' # color of the font
)
),
)
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
#weights dosent matter
# outliers missclassified for aggressive(ceo salary)
With the above graph we can say few drivers with bigger vehicle also drive aggressively
plot_data = []
# for i in np.sort(data.DrivingStyle.unique()):
trace1 = go.Box(y = data.speed_difference[data.DrivingStyle==1],
marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgb(9,56,125)'),
name = "Aggrressive")
trace2 = go.Box(y = data.speed_difference[data.DrivingStyle==2],
marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgb(9,56,125)'),
name = "Normal")
trace3 = go.Box(y = data.speed_difference[data.DrivingStyle==3],
marker = dict(color=c[np.random.random_integers(N)],outliercolor = 'rgb(9,56,125)'),
name = "Vague")
plot_data = [trace1,trace2,trace3]
layout = go.Layout(
autosize=True, # auto size the graph? use False if you are specifying the height and width
# width=1000, # height of the figure in pixels
# height=600, # height of the figure in pixels
title = "Boxplot of {} column based on {} ".format('Speed Difference','Driving Style'), # title of the figure
# more granular control on the title font
titlefont=dict(
family='Courier New, monospace', # font family
size=14, # size of the font
color='black' # color of the font
),
# granular control on the axes objects
xaxis=dict(
title='Driving Style',
tickfont=dict(
family='Courier New, monospace', # font family
size=14, # size of ticks displayed on the x axis
color='black' # color of the font
)
),
yaxis=dict(
title='Speed difference',
titlefont=dict(
size=14,
color='black'
),
tickfont=dict(
family='Courier New, monospace', # font family
size=14, # size of ticks displayed on the y axis
color='black' # color of the font
)
),
)
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
The distance between the preceeding vehicle and the vehicle is normally distributed as the speed of the vehicle and its precedding vehicle is also normally distributed
plot_data = []
temp = data.groupby(['DrivingStyle','binned_distance']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['DrivingStyle','binned_distance','Count']
# temp
temp['binned_distance'].value_counts()
for i in np.sort(temp['binned_distance'].unique()):
trace = go.Bar(x = temp.DrivingStyle[temp.binned_distance==i],
y = temp.Count[temp.binned_distance==i],
text = temp.Count[temp.binned_distance==i],
textposition = 'auto',
marker = dict(color=c[np.random.random_integers(N)]),
name = str(i))
plot_data.append(trace)
layout = go.Layout(title = 'Driving Style vs binned_distance',
yaxis = dict(title='Count'),
xaxis = dict(title='Driving Style'),
barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
The distance between the vehicle and its preceeding vehicle is not much. Here the driving rules at highways can be made more strict.
data.number_of_axles
plot_data = []
temp = data.groupby(['number_of_axles','binned_distance']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['number_of_axles','binned_distance','Count']
# temp
temp['binned_distance'].value_counts()
for i in np.sort(temp['binned_distance'].unique()):
trace = go.Bar(x = temp.number_of_axles[temp.binned_distance==i],
y = temp.Count[temp.binned_distance==i],
text = temp.Count[temp.binned_distance==i],
textposition = 'auto',
marker = dict(color=c[np.random.random_integers(N)]),
name = str(i))
plot_data.append(trace)
layout = go.Layout(title = 'Driving Style vs binned_distance',
yaxis = dict(title='Count'),
xaxis = dict(title='Driving Style'),
barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
Clearly the data of 2 vehicle is the most and we have the less dat for other vehicle data. In future we can collect more data with different axle of vehicle to do more in depth study
data.number_of_axles
plot_data = []
temp = data.groupby(['trip_month','DrivingStyle']).size().to_frame()
temp = temp.reset_index()
temp.columns = ['trip_month','DrivingStyle','Count']
# temp
temp['trip_month'].value_counts()
for i in np.sort(temp['DrivingStyle'].unique()):
trace = go.Bar(x = temp.trip_month[temp.DrivingStyle==i],
y = temp.Count[temp.DrivingStyle==i],
text = temp.Count[temp.DrivingStyle==i],
textposition = 'auto',
marker = dict(color=c[np.random.random_integers(N)]),
name = str(i))
plot_data.append(trace)
layout = go.Layout(title = 'Trip Month vs Driving Style',
yaxis = dict(title='Count'),
xaxis = dict(title='Trip Month'),
barmode='stack')
fig = go.Figure(data=plot_data, layout=layout)
iplot(fig)
In the graph the data points for the month 6, 7, 8, 9, 10 are missing as we are concerned with aggressivness in bad weather condition.